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Abstract — Sharing data from various sources and of diverse 
kinds, and fusing them together for sophisticated analytics and 
mash-up applications are emerging trends, and are prerequisites 
for grand visions such as that of cyber-physical systems enabled 
smart cities. Cloud infrastructure can enable such data sharing 
both because it can scale easily to an arbitrary volume of data 
and computation needs on demand, as well as because of natural 
collocation of diverse such data sets within the infrastructure. 
However, in order to convince data owners that their data are 
well protected while being shared among cloud users, the cloud 
platform needs to provide flexible mechanisms for the users to 
express the constraints (access rules) subject to which the data 
should be shared, and likewise, enforce them effectively. We study 
a comprehensive set of practical scenarios where data sharing 
needs to be enforced by methods such as aggregation, windowed 
frame, value constrains, etc., and observe that existing basic 
access control mechanisms do not provide adequate flexibility to 
enable effective data sharing in a secure and controlled manner. 
In this paper, we thus propose a framework for cloud that extends 
popular XACML model significantly by integrating flexible access 
control decisions and data access in a seamless fashion. We 
have prototyped the framework and deployed it on commercial 
cloud environment for experimental runs to test the efficacy of 
our approach and evaluate the performance of the implemented 
prototype. 

Keywords: cloud computing, access control, flexible shar- 
ing, fine-grained policies, XACML 

I. Introduction 

The emergence of cloud computing in recent years is rapidly 
changing the way businesses and government agencies, as well 
as individuals, are storing and managing their data as well as 
workflows. Instead of developing and maintaining individual 
data management infrasttuctures and data sharing mechanisms, 
data owners now leverage on the cloud services to make their 
data available to users. The fact that data from multiple sources 
now reside in one logical place, i.e., the cloud, makes it much 
easier than ever before to develop large scale applications 
that require data and knowledge from multiple domains and 
sources. These applications could include environmental study, 
city infrastructure planning, disaster monitoring, and many 
more. In an era when the cloud infrastructure was non-existent, 
to develop such applications, the developer would have to 
first talk to individual data owners to specifically provide the 
data to them, which is likely to involve tedious administration 
procedures such as signing documents regarding the privileges 
and responsibilities of each parties, apart from the cumbersome 
process of actually shipping the data. Then the developer 
would have to develop software that work with the individ- 
ual data exchange interfaces/protocols provided by different 



owners to collect and reformat the data before they could be 
fed into the applications for analysis or real-time monitoring 
tasks. 

On the multitenant cloud, such data from diverse sources are 
naturally collocated, making it much easier and much more 
efficient for the application developers to obtain what they 
need for their work. More specifically, the storage and data ex- 
change can be handled efficiently by the cloud providers. This 
means data owners need not worry about how to share, but 
what and who to share. Putting one's proprietary data online 
on the cloud raises concerns regarding data security, privacy 
and ownership. Even if the cloud service provider is trusted, 
and legally obliged (through service level agreements and law 
enforcement) to prevent illegal access of data and information 
leakage, there needs to be meaningful, comprehensive and 
flexible ways for the data owners to express their sharing 
preferences, in a manner which can readily be interpreted and 
enforced by the cloud service provider. This paper discusses 
how this can be achieved. One can further argue how this can 
be realized if the cloud service provider is not even trusted, 
but that is an issue outside the scope of this work, and is part 
of our future work. 

The objective of this work is to propose and showcase a 
framework for sharing data on the cloud. The framework, 
called eXACML, facilitates sharing in an easy-to-use, secure, 
flexible and scalable manner. For security, we make use of/ex- 
tend XACML [21 ] — the XML-based and popular framework 
for access control. XACML has become a standard for specify- 
ing and enforcing access control policies. It evaluates requests 
for resources against a set of policies and returns permit or 
deny decision, which does not involve accessing any data. In 
eXACML, we extend XACML to support more fine-grained 
policies as well as to handle data processing. We demonstrate 
eXACML's flexibility by using it in different access control 
scenarios with different levels of granularity. For usability, 
eXACML provides an intuitive, easy-to-use interface for data 
owners and data users to specify and enforce security policies 
and to access shared data. Finally, we carry out experiments to 
evaluate the framework performance in a cloud-like environ- 
ment, the results of which suggests that eXACML is scalable. 
We motivate our work with scenarios from ongoing works on 
better city planning, specifically related to weather and traffic 
information, and the evaluations are also based on datasets, 
part of which are real, while the rest is synthetic. 

In summary, the main contributions of this work are as 
follows: 

1) We demonstrate the needs for secure and flexible data 
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sharing with practical examples involving city planning 
and management based on data from weather and traffic 
monitoring stations. We discuss scenarios in which ac- 
cess control with different levels of granularity of data 
access are needed. 

2) We extend the XACML framework to support fine- 
grained policies. In particular, fine-grain access control 
policies (which require data filtering) are expressed 
within obligations that are passed from the Policy Deci- 
sion Point (PDP) to the Policy Enforcement Point (PEP), 
which connects to the database and processes the data 
queries embedded in the obligations. We refer to this 
implementation as 

XACML*. We discuss why this approach could perform 
better than the traditional approach based on views. 

3) We implement a prototype of the framework 
(eXACML), providing additionally, an easy-to-use user 
interface. The prototype allows data owners to easily add 
and modify their policies. Data users can query meta 
data and details of access policies at remote servers. 
They can also specify aggregated data from multiple 
sources in single requests. Responses to data requests 
contain information of matching policies, enabling flex- 
ible conflict resolutions. 

4) We evaluate the performance of our prototype in cloud- 
like settings. Our experiments illustrate that the frame- 
work incurs low overhead. We attribute this scalability to 
the framework's ability to cache responses and perform 
aggregation of responses from multiple sources prior to 
returning them to the data users. 

The rest of this paper is organized as follows: Section 2 
describes practical scenarios that motivates our framework. 
Section 3 details our extensions to XACML, followed by the 
logical design of our framework in Section 4. The prototype 
and its evaluation are presented in Section 5. We discuss 
related and future works in Section 6 and Section 7 and 
conclude in Section 8. 

Before proceeding further, we'll like to make a final note 
on the scope of the current work and implementation. Broadly 
speaking, there are two kinds of data - data already stored 
in the system (which we refer to as archived/archival data), 
and data stream, where live data is flowing into the system. 
Likewise, the queries could be 'on demand', typically on the 
stored data, or continuous queries, to be evaluated on the 
incoming data streams. The current implementation deals with 
on demand queries on stored data. This is summarized in Table 

m 

II. Motivating Example 

As increasing portion of the world population is rapidly 
moving to the cities, while the resources at our disposal 
are shrinking at an alarming rate, numerous research and 
industrial initiatives (e.g., IBM's smart cities initiative^} are 
focusing in realizing what are being termed as 'smart(er) 
cities' in order to manage resources efficiently at the city 



scale. Enabling such a move towards smarter cities are cyber- 
physical systems aggregating data and actuating the necessary 
resource management actions at the edge, while the necessary 
data storage and analytics is carried out on cloud based back- 
end. 

In this section, we use some scenarios of road congestion 
analysis to showcase the need among data owners for flexible 
data sharing. 

A. Settings. 

Noticing that one of the major expressways in the city suf- 
fers serious congestion during every monsoon season, Singa- 
pore's Land Transport Authority (LTA) has, after preliminary 
studies, hypothesized that such congestion is mainly caused 
by three factors, (1) large number of vehicles on the road, (2) 
slow speed of vehicles, (3) bad weather. 

To validate such preliminary conclusions and build a traf- 
fic condition model during the monsoon season, researchers 
need more data. Fortunately, many organizations have been 
collecting related data: LTA itself has a number of sensors 
deployed along the road side to record traffic volume, i.e., 
the number of vehicles passing by at unit time; furthermore, 
another independent entity, a large local taxi company, collects 
the speed and location data from their taxis' GPS devices. At 
almost any time, there are a number of such taxis running over 
the whole stretch of the express way. Likewise, the national 
environmental agency (NEA) has several weather stations 
deployed close to the congested areas, that record weather 
parameters such as temperature, humidity, rain rate, etc. 

If all these different data owners use a shared cloud infras- 
tructure] to store and process the above mentioned data-sets 
for their individual needs, then when complex analytics involv- 
ing multiple such datasets become necessary, the data is readily 
available on the infrastructure thanks to such collocation on 
the multi-tenant cloud. 

Suppose the data are stored in relational tables as shown in 
Table III] for traffic volume information, Table [HI] for cab's 
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'http://www.ibm.com/smarterplanet/us/en/smarter_cities/overview/index. 



location and speed information and Table IV for weather 
information. 



B. Example 1 

Suppose that NEA decides to share (possibly for a price) 
only the rain rate data with LTA researchers, since other 
weather parameters such as temperature and humidity are not 
expected to affect traffic condition as much as rainfall does 
in the context of Singapore, and hence LTA does not want 
pay for the temperature or humidity information. Furthermore, 
even if the original collected data available with NEA is for 
one minute interval, it may want to expose only the data 
corresponding to five minute averages to LTA. It may also 
expose the more detailed data to its own employees or to other 
customers. 

The first constraint corresponds to the projection operation 
in the relational database model and a sample SQL query will 

2 Note that we are unaware of the current practice of the individual 
organizations mentioned above, and what follows is a hypothetical scenario. 
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Query/Database 


Archival (relational) databases 


Stream databases 


On demand query 


current implementation 


n/a 


Continuous query 


n/a 


Future work 



TABLE I: Scope of eXACML, regarding database and query type 



SamplingTime Traffic Volume 
2011-06-06 10:00:00 60 
2011-06-06 10:05:00 67 
2011-06-06 10:10:00 50 



TABLE II: Table Trafficlnfo: Traffic volume data from road side sensors 



SamplingTime Speed (km/hr) latitude longitude 
2011-06-06 10:00:00 100 xl yl 
2011-06-06 10:05:00 80 x2 y2 
2011-06-06 10:10:00 40 x3 y3 



TABLE III: Table Vehiclelnfo: Vehicle speed and location data from GPS devices 



SamplingTime 


Temperature(C) 


2011-06-06 10:00:00 


27.2 


2011-06-06 10:01:00 


27.5 


2011-06-06 10:02:00 


27.5 


2011-06-06 10:03:00 


27.4 


2011-06-06 10:04:00 


27.3 


2011-06-06 10:05:00 


27.3 


2011-06-06 10:06:00 


27.0 


2011-06-06 10:07:00 


27.1 


2011-06-06 10:08:00 


26.8 


2011-06-06 10:09:00 


26.6 


2011-06-06 10:10:00 


26.5 



Humidity (%) RainRate (mm/hr) 



70 


0.0 


70 


0.0 


73 


0.0 


72 


0.0 


75 


0.0 


76 


0.0 


77 


0.1 


80 


5.0 


81 


14.0 


82 


20.0 


85 


34.4 



TABLE IV: Table Weatherlnfo: Weather data from weather stations 



be something like "select RainRate from Weatherlnfo". The 
second constraint can be considered as a sliding window query 
over a data stream, i.e., the time series rain rate data. Standard 
SQL does not support these kind of queries well, hence 
additional operations need to be implemented on top of the 
RDBMS query engine. To specify a sliding window query on 
a time series data sequence in our scenario, five parameters are 
needed, namely, the starting time, ending time, window size, 
window advance step and aggregation function. The starting 
time and ending time are the general temporal constraints that 
specify the segment of the data stream to be returned. The 
window size and window advance step decide the length of 
the query window and how fast the window is moving along 
the data stream. The aggregation function includes numerical 
functions such as averagef), max(), min(), count(), etc., which 
are applied to the data records to summarize the portion of 
the data stream within the window. 

C. Example 2 

Consider that the taxi company agrees to help the re- 
searchers by providing their taxis' location and speed data, but 
the company only wants to share such information for taxis 
within some specific regions in the vicinity of the congested 
areas being studied, instead of exposing the information about 
its whole fleet, which it deems important business secret not 
to be exposed to third parties. To enforce such a constraint, 
a selection operator is applied to the longitude and latitude 



columns to filter out those records that are not supposed to be 
shared with the researchers. For the sake of simplicity, assume 
that this range is specified by a rectangle with the geographical 
coordinate of the upper left vertex as (ai,bi) and of the lower 
right vertex as (a2,b2), we can have the corresponding SQL 
query: select SamplingTime, Speed from Vehiclelnfo v where 
v.longitude —> a\ and v.longitude <= a 2 and v.latitude >= 
b 2 and v.latitude <= b\. 



To enable the above access contraints in XACML, we make 
use of the obligation element in policy element to specify the 
constraints. Fig.|4]and Fig. |5]present two examples of XACML 
obligations that embed these constraints. In Figure [4] line 2 
indicates that the permission to perform the sliding window 
query if the decision returned from PDP is 'permit'. Line 3 
indicates that the aggregation function to be used in the sliding 
window query is average calculation. Lines 5 to 8 specify that 
starting time is zero o'clock of June 6th, 2011, ending time is 
zero o'clock of June 7th, 2011, window size is 5 minutes and 
window advance step is also of 5 minutes. Line 9 indicates 
that the sliding window is applied on SamplingTime column 
as well, besides on the actual rain rate data column, which is 
not shown here within the obligation part. Line 3 in Figure [5] 
shows the selection predicate to be included in the SQL query 
to be evaluated on the data table, which only allows vehicle 
information to be returned if the vehicle's location is within a 
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given boundary. 

D. Fine-grained Policies 

The examples above demonstrate real needs for an access 
control model that supports fine-grained policies involving 
fine-grained data processing. At a high level, the models need 
to be able to express and enforce the following types of 
policies: 

1) Aggregated data: Only results of aggregation functions 
over raw data such as average,sum, min, max are shared. 

2) Trigger-based: a row of data is accessible only if the 
value of a column satisfies a certain predicate: exceeds 
a specific threshold, or is contained within a range. 
As an example, a taxi company is granted access to 
temperature reading only if the temperature is over 
30°C. 

3) Sliding window: a sliding window is specified by its 
starting time, ending time, window size and advance 
step. Only aggregated data (average, for instance) over 
the windows are accessible. 

4) Approximation: only data whose values approximate 
those given in the requests are accessible. For example, a 
request includes a value X, and the policies is specified 
such that a row of data is returned only if the column 
c's value V satisfies \V — X\ < e for some distance 
function. 

We next explore how such fine-grained policies can be 
flexibly supported. 

III. Flexible Sharing Through Fine-Grained 
Policies 

Existing frameworks, such as XACML, do not natively 
support different levels of granularity to support fine-grained 
access control. Nevertheless, XACML has emerged in recent 
years as a mature and widely used model for expressing 
and enforcing access control policies. Therefore, we extend 
XACML in order to support fine-grained policies, including 
those described in Section 2. 

For the rest of this paper, we assume relational databases 
(SQL types) are used for managing data in the back-end. 
Without loss of generality, but for the purpose of simplicity of 
exposition, we consider that each database consists of a single 
table indexed by time values. When requesting for data, the 
user provides his credentials (for example, name and role) and 
specifies the location of data. The response contains either a 
deny decision (i.e. no access to the data), or permit decision 
together with the returned data as specified in the policies. 

A. XACML 

XACML is an OASIS framework for specifying and enforc- 
ing access control [2TJ. It is XML based and the latest version 
is 3.0. XACML allows administrators to control their resources 
by writing policy files, which are then loaded into a Policy 
Decision Point (PDP) module. An user wishing to access a 
specific resource sends request to a Policy Enforcement Point 
(PEP) where the decision is made by consulting the PDP. 



XACML specifies standards for writing policies, requests and 
interpreting the response. 



1) Subjects, Resources and Actions. A subject in XACML 
has a set of credentials such as its name, role, etc. The 
subject wishes to perform certain actions (read, write, 
for example) on a set of system resources. 

2) Requests. Request for accessing system resources are 
written in XML. The subject credentials, system re- 
sources and actions are specified in one or more At- 
tribute elements included in the Subject, Resource and 
Action elements respectively. Fig. [T] shows an example 
of an XACML request from a subject with role admin 
to perform read action the temperature column from 
weather_data database. 

3) Policies. A policy contains a Target, a set of Rules 
each of which has at most one Condition, and a set 
of Obligations. Multiple policies can be grouped into 
a policy set, which has its own Target element. The 
policy is indexed by its Target element, which consists 
of a number of conditions needed to be satisfied by the 
request before the rest of the policy can be evaluated. 
Conditions are essentially boolean expressions over the 
values included in the request. The policy returns ac- 
cess control decision which is either Permit, Deny, Not 
Applicable or Intermediate. The last two are used when 
there is no applicable policy or an error occurred during 
evaluation. Fig. [2] illustrates an example of an XACML 
policy that grants access to subjects with government 
role to the samplingtime and temperature columns of 
weather _data. 

When more than one rules are applicable to a par- 
ticular request, they are evaluated according to rule 
combination algorithm specified in the policy. Similarly, 
multiple applicable policies in a policy set are evaluated 
according to a specified policy combination algorithm. 
Examples of combining algorithms (for both policies 
and rules) are Permit-overrides where a permit policy 
or rule is evaluated, and First-applicable where the first 
applicable policy is evaluated. 

4) Policy Enforcement Point (PEP). User requests first go 
through the PEP, which translates them into canonical 
forms before passing to the PDP. Additionally, PEP also 
interprets responses and obligations returned from the 
PDP. In summary, PEP deals with application logics 
and acts as the access control enforcement mechanism. 
Our framework extends PEP to provide support for more 
fine-grained policies. 

5) Policy Decision Point (PDP). Data owners' policies are 
loaded into the PDP, which evaluates requests received 
from the PEP against the active policies. Its main task is 
to efficiently find applicable policies for a given request 
and to quickly evaluate their rules and conditions to 
determine the access control decision. It sends back to 
PEP a well-formed response containing a decision and 
a set of obligations. 
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<Sub ject> 

<Attribute Attributeld = * "exacml : sub ject : role-id' ' 

DataType={http : //wwww . w3 . org/2 00 l/XMLSchema#st ring } > 
<Att ributeVal ue>admin</ At tribute Value > 
</Attribute> 
</Sub ject> 

<Resource> 

<Attribute Attributeld = ' 'exacml : rdmb-database-id' ' 

DataType={http: //www.w3 . org/2 01 /XMLSchema#st ring } > 
<AttributeValue>weather_data</AttributeValue> 
</Attribute> 

<Attribute Attributeld - * 'exacml : rdmb-column-id' ' 

DataType={http: //www.w3 . org/2 01 /XMLSchema#st ring } > 
<AttributeValue>temperature</AttributeValue> 
</Attribute> 

</Resource> 
<Action> 

<Attribute Attributeld = ' 'exacml : action-id' ' 

DataType={http : //www . w3 . org/2 01 /XMLSchema#st ring } > 
<AttributeValue>read</AttributeValue> 
</Attribute> 

</Action> 

Fig. 1: Example of a well-formed XACML request, in which the user with the role admin requests read access to the column 
temperature of the database weather _data 



<Target> 
<Sub jects> 
<Sub ject> 

<Sub jectMatch Mat chid- "urn : oasis : names : t c : xacml : 1 . : function : st ring-egual "> 
<AttributeValue DataType="http : //www . w3 . org/20 01 /XMLSchema#st ring" > 

government 
</AttributeValue> 

<Sub jectAttributeDesignator Attributeld="exacml : subject : role-id" 

DataType="http : //www . w3 . org/2 00 l/XMLSchema#st ring" /> 

</Sub jectMatch> 

</Sub ject> 
</Sub jects> 
<Resources> 

<Resource> 

<ResourceMatch Mat chid- "urn : oasis : names : tc : xacml : 1 . : function : string- equal " > 
<AttributeValue DataType="http : //www. w3 . org/20 01 /XMLSchema#st ring" > 

weather_data 
</AttributeValue> 

<ResourceAttributeDesignator Attributeld="exacml : rdbms-database-id" 

DataType="http : //www. w3 . org/2 00 l/XMLSchema#st ring" /> 

</Sub jectMatch> 
</Resource> 
<Resources> 
<Actions> 

<AnyAction/> 
</Actions> 
</Target> 

<Rule Ruleld-"example" Effect-"Permit"> 

<Target> <Subjects> <AnySub ject/> </Subjects> 

<Resources> <AnyResource/> </Resources> 

<Actions> <AnyAction/> </Actions> 
</Target> 

<Condit ion FunctionId-"urn: oasis: names : t c : xacml : 1 . : function : string-sub set "> 
<ResourceAttributeDesignator Attributeld="exacml : rdbms-column-id" 

DataType="http: //www.w3 . org/2 01 /XMLSchema#st ring" /> 
< Apply FunctionId-"urn: oasis: names : tc : xacml : 1 . : function : string-bag" > 

<AttributeValue DataType="http : //www . w3 . org/20 01 /XMLSchemalst ring" > 

samplingtime 
</AttributeValue> 

<AttributeValue DataType="http : //www . w3 . org/20 01 /XMLSchemalst ring" > 

temperature 
</AttributeValue> 
</Apply> 
</Condition> 
</Rule> 



Fig. 2: Example of a well-formed XACML policy which grant access to column samplingtime or temperature of the database 
weather _data to any subject with role goverment 
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Fig. 3: Extensions to XACML that support more flexible 
access control policies. 



B. View-Based vs Obligation-Based 

The traditional access control model in relational databases 
is based on view [24 J. Basically, a view is the result of a 
SQL query on existing tables, to which read/write access 
are specified. The database management systems maintain the 
views and enforce access control rules on them. 

A simple approach based on view to support fine-grained 
policies with XACML can be realized as follows. First, views 
are created with no access control restriction, and assigned 
with unique resource IDs. This can handle all types of policies 
discussed earlier. PEP maintains a mapping between the IDs 
and actual views. Next, the IDs are used to specify the 
resources in XACML policies, as well as to construct data 
requests. Once PDP returns a permit decision, PEP retrieves 
and returns the corresponding views. 

However, there are a number of weaknesses with this 
approach: 

• Views need to be created prior to policies or requests. 
They must also be removed explicitly by the data owner. 

• Views are static and may be very large in number 
(potentially infinite number of views for trigger-based 
and sliding window policies). Maintaining these views 
are inefficient at best and impossible at worst. 

• An user requesting for data must also maintain a mapping 
of all the view IDs they wish to access. Not only is such 
a requirement undesirable for data users, but also it is 
expensive to implement. 

Fig. [3] illustrates the obligation-based approach (extensions 
to XACML is highlighted in bold). The basic idea is to embed 
queries for creating views into obligations. The PEP, upon 
receipt of the obligations, executes the embedded queries on 
the database and returns the results in a well-formed response. 
Unlike the view-based approach, the size of data (views) 
maintained by PEP is bounded. Furthermore, popular queries 
can be cached by the database management system or the 
PEP. In the experiment section, we demonstrate the benefit of 
caching in improving request time. 

C. Implementations 

1) Obligations.: Using obligation-based approach, policy 
writers utilize different types of obligations to specify different 
database queries. Our current implementation supports four 
types of obligations (Table [V): 



Description 


Obligationld 


Column aggregation 
Simple selection 
Sliding window 
Approximation 


exacml 
exacml 
exacml 
exacml 


obligation 
obligation 
obligation 
obligation 


column- 
simple- 
column- 
column- 


-aggregation 
-selection 
-sliding-window 
-approximation 



TABLE V: Obligation types 



1) Column aggregation: consists of a string attribute with 
ID 

exacml : obligation : aggregation-id. The string 
represents an aggregation function, such as average 
(Fig. [4] line 2-3), min, max, count or sum. 

2) Simple selection: consists of a string attribute with ID 
exacml : obligation : selection-id. The string is a 
boolean expression that will be used as the WHERE 
clause when constructing the database query. An ex- 
ample of this obligation is shown in Fig. [5] in which 
the policy restricts access to data to within a certain 
geographical region. 

3) Sliding window: we assume that the column from which 
the sliding windows are based is of type DateTime 
(although sliding windows could be constructed from 
any other sortable types). The obligation consists of a 
number of attributes: 

• Sliding window column: string attribute with ID 

exacml : obligation : sliding-window-column-id 
specifies the column of type DateTime from which 
sliding windows are constructed. 

• Start and End: time attributes with IDs 

exacml : obligation : sliding-window-start-id 
and 

exacml : obligation : sliding-window-end-id 
respectively. 

• Window size: integer attribute with ID 

exacml : obligation : sliding-window-size-id 
specifies the window size (in hours). 

• Advance step: integer attribute with ID 

exacml : obligation : sliding-window-step-id 
specifies how the sliding window advances, i.e. 
the number hours between starting time of two 
consecutive windows. 

Fig. |4] (line 4-10) shows an example of a sliding window 
based on SamplingTime column. The window's size 
is 5 hours, starting from 2011-06-06 00:00:00, 
advancing in 5-hour steps until 2011-06-07 
00 : 00 : 00. 

4) Approximation: this obligation specifies the acceptable 
distance between the column values with respect to 
the values included in the request. Attributes containing 
column IDs are specified in both the requests and the 
policies. Specifically: 

• In the request: string attribute with ID 

exacml : data-value-id is of the form 
<columnld> : <value> which represent the value 
of the specified column. 

• In the policy: string attribute with ID 

exacml : obligation : approximation-param-id 
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contains the column IDs. Columns specified in the 
requests must be a subset of what is specified in 
the policies. Also required is a double attribute 
with ID 

exacml : obligation : approximation-value-id 
which represents the distance between the vector 
of column values in the database and that included 
in the request. 

2) Handling obligations.: PEP extracts attributes embedded 
in the obligations and constructs corresponding queries to be 
executed on the database. It is not uncommon for a policy 
to have more than one types of obligations, which allows for 
more expressive, fine-grained conditions for accessing data. 
Essentially, PEP creates queries of the following form: 

select f (column_l) , f ( column_2 ) , . . , f ( column_n) 
from Table_name where Where_Condition 

(1) 

where column_i (1 < i < n) and Table_name are 
extracted from the Resources element of the request. When 
no obligation is returned, f and Where_Condition are set 
to empty strings. In this case, the query becomes: 



select column_l, column_2, . 
from Table name 



. column_n 



PEP obtains f from the string attribute in the column ag- 
gregation obligation. When a simple selection obligation is re- 
turned, Where_condition is taken directly from its string 
attribute. For approximation obligations, the PEP first retrieves 
a vector of values from the request, namely (xi,X2, --,Xk) 
from columns 

ci, C2, ■■, Cfc. It then obtains the distance value 5 in the obliga- 
tion, and sets Where_condition as: 

sqrt((ci - Xi).(ci - ii) + .. + (c fe - x k ).{c k - x k )) < S 

Handling sliding-window obligations are more complex. 
First, the tuple 

(start, end, window _size, advancing _step) are extracted 
from the obligation. The total number of windows are: 



nW = I 



end — start — window_size + 1 
advancing _step 



J+l 



For every window, PEP creates a different query. More specif- 
ically, let c be the column (of type DateTime) from which the 
sliding windows are constructed, a query i (0 < i < nW) is 
of the form: 

select f(column_l), f ( column_2 ) , . . , f ( column_n) 
from Table_name where Where_Condition 
AND c > start+step*i 
AND c < start+step*i+size 

where Where_Condition are constructed from simple 
selection and approximation obligations. 

IV. The Logical Framework 

This section presents our design of the framework that 
enables secure, easy-to-use, flexible and scalable data sharing. 
The security comes from the use of XACML for specifying 
and enforcing access control. The flexibility property is the 
result of our enhancement to XACML which supports a wider 
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Fig. 6: eXACML framework. XACML* denotes the extended 
XACML described in Section 3. 



range of access control policies. Usability and scalability are 
achieved through a simple client interface and the use of a 
proxy server, whose details are described below. 

A. Entities 

Fig. [6] illustrates the main entities and how they interact 
in our framework. Clients consist of data owners who wish 
to share and enforce access control on their datasets, and of 
data users who are interested in accessing the data. A data 
owner can have more than one datasets and a data user can 
request access to multiple datasets. Databases are database 
servers which manage clients' datasets. Access to the database 
is controlled by at least one instance of XACML* (discussed 
below). These servers are likely to be remote and maintained 
by a third party (cloud) provider. 

Our framework — eXACML — is positioned in between 
clients and databases (Fig. [6]). Its roles are to mediate their 
interactions and to safeguard the databases. Essentially, eX- 
ACML is made up of a client interface, a proxy server, cloud 
servers and XACML* instances. 

• Clients interact with the databases through a local client 
interface that parses inputs into request messages and 
forwards them to the proxy server. It waits and interprets 
response messages before returning them back to the 
clients. This interface abstracts out the complexity of 
exchanging well-formed messages with the proxy server. 
It allows clients to share and query data in an intuitive 
manner. 

• A cloud server (or server), usually located in the same 
machine as the databases, accepts and processes client 
requests. We will refer to this component as server. 
It manages and responses to meta queries concerning 
XACML* instances. For data requests, it forwards them 
to the appropriate XACML* instances and sends the 
results to the proxy in well-formed messages. 
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<Obligations> 

<Obligation Obligationld="exacml : obligation : column-aggregation" FulfillOn = "Permit"> 
<AttributeAssignment Attributeld-"exacml : obligation : aggregation-id" 

DataType = "http://www.w3.Org/2001/XMLSchema#string"> 

avg 

</AttributeAssignment> 
</Obligation> 

<Obligation Obligationld-"exacml : obligation : column-sliding-window" FulfillOn = "Permit"> 
<AttributeAssignment Attributeld-"exacml : obligation : sliding-window-start-id" 
DataType = "http://www.w3.Org/2001/XMLSchema#time"> 
2011-06-06 00:00:00 
</ At tribute As signment> 

<AttributeAssignment Attributeld-"exacml : obligation : sliding-window-end-id" 
DataType = "http://www.w3.Org/2001/XMLSchema#time"> 
2011-06-07 00:00:00 
</AttributeAssignment> 

<AttributeAssignment Attributeld="exacml : obligation : sliding-window-size-id" 
DataType = "http://www.w3.Org/2001/XMLSchema#integer"> 

5 

</ At tribute As signment> 

<AttributeAssignment Attributeld-"exacml : obligation : sliding-window-step-id" 
DataType = "http://www.w3.Org/2001/XMLSchema#integer"> 

5 

</ At tribute As signment> 

<AttributeAssignment Attributeld="exacml : obligation : sliding-window-column-id" 
DataType = "http://www.w3.Org/2001/XMLSchema#string"> 

samplingtime 
< /Att ributeAss ignment > 
</Obligation> 
</Obligations> 



Fig. 4: Obligation portion of the XACML policy for Example II-B 



<Obligations> 

<Obligation Obligationld-"exacml : obligation : simple-selection" FulfillOn = "Permit" 
<AttributeAssignment Attributeld-"exacml : obligation : selection-id" 

DataType = "http://www.w3.Org/2001/XMLSchema#string"> 
longitude >= al and longitude <= a2 and latitude >= b2 and latitude <= bl 
</ At t ributeAss ignment > 
</Obligation> 
</Obligations> 



Fig. 5: Obligation portion of the XACML policy for Example II-C 



• XACML* is an implementation of the extended XACML 
model described in Section 3 (Fig. [3]). It processes data 
requests (received from the cloud server) by first asking 
PDP for the access decision. If permitted, it executes the 
obligations, which involves querying the database. The 
result is forwarded back to the cloud server. 

• Communications between clients and servers go through 
a proxy server (or proxy). It processes requests from 
clients before forwarding them to the servers, and com- 
bines the results into client response messages. As an 
example, suppose a request from a data user requires 
accessing data from multiple datasets, the proxy first 
creates multiple requests and sends to the corresponding 
servers. It waits for all the responses from servers, then 
combines the results into a single response message for 
the data user. 

The benefit of having the proxy server is two-fold: 

1) Improved performance: Combining data before re- 
turning to the users reduces communication costs. 
Caching at the proxy can also improve response 
time and reduce both computation and communica- 
tion costs for the database servers. We demonstrate 
this effect in the evaluation section. 

2) Additional level of abstraction: The proxy server 
acts like a DNS service mapping datasets into to 



global, easy-to-remember names, achieving network 
data independence, which makes it easier for clients 
to manage and query data. 

B. Trust and Data Model 

We assume cloud severs and the proxy server are honest. 
This means that they are trusted to run the correct, latest 
eXACML framework. They are also trusted not to violate 
data privacy. More specifically, the proxy is trusted not to 
tamper with the data received from database servers, and not 
to violate data privacy. The only adversaries are rouge clients 
who can collude in attempt to gain unauthorized access to the 
datasets belonging to honest data owners. We remark that these 
assumptions (particularly, that of trusted service providers) are 
reasonable since cloud service providers are striving to gain 
reputation to run their business, and furthermore have legal 
obligations based on Service Level Agreements |23|. 

We assume that datasets are managed by relational database 
systems. For simplicity, each data owner has at most one 
dataset. This assumption can be relaxed by virtualizing the 
data owner, so that it has multiple identities, each of which 
possesses a different dataset. 

C. Cloud Model 

We now discuss different ways to connect the database, 
XACML* and cloud server components. As seen in Fig. [6] the 
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Fig. 7: Interaction model of the cloud server, XACML* and 
database 



number of servers, the number of databases and XACML* in- 
stances do not have to match. In particular, multiple databases 
may share the same XACML* instance, while a cloud server 
may handle multiple XACML* instances. 

A server represents a logical, addressable machine to which 
the proxy connects. One server can handle requests for mul- 
tiple datasets, but we assume each server is connected to one 
dataset. This assumption is reasonable since each data owner 
has at most one dataset, and it is likely that data owners use 
independent virtual machines. 

Next, we consider the question of how XACML* in- 
stances are shared among databases. At one extreme, a sin- 
gle XACML* instance is sufficient to deal with all access 
requests. In this case, the servers connect to the the same 
XACML* instance, and policies are added to the same PDP. 
The PEP has access to multiple databases at different ma- 
chines. However, this approach introduces a single point of 
failure, and data owners may prefer to have their access 
control systems separated from each other. Moreover, extra 
layers of authorization is required to prevent rouge clients 
from uploading policies associated with datasets of honest 
data owners. At the other extreme, the server maintain one 
XACML* instance per dataset. Since data requests can be 
processed in parallel, this approach could lead to significant 
improvement in performance. However, a potential drawback 
is the overhead in maintaining a large number of XACML* 
instances, especially if many are idle. 

When multiple datasets share the same physical machine 
(but are in separate virtual machines), it makes more sense 
for them to share one XACML* instance. This approach 
benefits from the parallelism in processing requests, while 
having reduced overhead in maintenance. However, sharing an 
XACML* instance experience the same problem with single 
point of failure and extra layer of authorization as with a single 
XACML* instance. 

Considering the above trade-offs, in this paper, we finally 
adopted the simple, no-sharing approach, i.e. one server con- 
nects to one XACML* that safeguards one database (illustrated 
in Fig. [7]). This model does not require another layer of 
authorization and therefore is easy to implement. 



D. One or Multiple Proxies? 

Having multiple proxies addresses the trust problem as- 
sociated with a single proxy. It could also improve client 
throughputs, since requests can be processed in parallel. 
However, joining data — one of the proxy's main features 
— across multiple proxies is more complex. Since proxies 
also maintain data caches, a mechanism for cache coherence 
among distributed servers is also required. Therefore, trade- 
offs between efficiency and maintenance overhead must be 
carefully considered. Our current framework employs only one 
proxy. We defer the protocols with multiple proxies for future 
work. 

E. Initialization 

In the beginning, a data owner creates a database for its 
datasets and initializes an XACML* instance at a remote data 
server. The XACML* instance starts with an initial policy 
specifying who can add and remove data and policies. This 
process is done by invoking 

{ success , f ail } 

<- initDatabase (host, port, datalD, 

databaseType, credentials) 

where host , port are the address of the server, datalD 
is the unique identifier of the dataset, databaseType is 
name of the database management system (MySQL, for exam- 
ple), and credentials consists of the data owner's name, 
role and other authentication information for accessing the 
server. The client interface wraps these parameters into a 
message forwarded to the proxy, then sends it to the specified 
server. After authenticating the data owner, the server creates 
the database, starts an XACML* instance and connects its 
PEP to the database. Finally, the server uploads a root policy 
to the newly created XACML* instance. The root policy 
specifies that only users with credentials can add new 
data, upload new and remove existing policies. This policy 
prevents other clients from adding their own policies to this 
XACML* instance. 

If successful, the proxy creates a new mapping from 
data ID to the dataset, as explained next. 

F. Data and Policy Management. 

Once a database is initialized successfully, it can be identi- 
fied uniquely by its data ID. The proxy maintains a mapping 
dataID_to_desc, which is a list of: 

datalD : <host , port, database name> 

All client requests contain datalDs. The proxy resolves 
locations of the dataset using its mapping, before forming 
new requests and forwarding them to the appropriate database 
servers. 

a) Adding and removing data.: To add or remove new 
data from a dataset, the data owner invokes 

{succses, fail} 

<- addData (data file, datalD, credentials) 
{success, fail} 

<- removeData (remove query, datalD, 
credentials) 
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where data file contains data to be added to datalD 
using the given credentials, remove query is the 
query to remove records from the database. The client interface 
sends a request to the proxy, which in turn constructs and 
forwards a well-formed XACML request together with the file 
hash or query hash to the server. The server keeps the hash as 
the pending add or pending removal token. Only if the access 
control decision is 'permit' does the client interface sends 
data file or remove query to the server, which verifies 
that the content hash matches with the pending add or pending 
remove before performing the query. In this protocol, the hash 
value is used to prevent other data owners from adding rouge 
data or remove unauthorized data. 

b) Loading and removing policy. : Every loaded policy is 
identified uniquely by its ID of the form data ID : policy ID 
where policylD is the integer index of the policy. The 
XACML* instance maintains an index counter which advances 
whenever a new policy is added. 

To add or remove a policy, a data owner invokes 

{policylD, fail} 

<- loadPolicy (policy file, datalD, 
credentials) 

{success, fail} 

<- removePolicy (policylD, datalD, 
credentials) 

where policy file contains the XACML file to be 
uploaded to datalD using the given credentials. The 
policy to be removed is identified by the tuple (data ID, 
policylD). The client interface forwards a request to 
the proxy, which creates a well-formed XACML request 
(for loading or removing policy) using data ID and the 
credential. Once arrived at the server, the request is 
evaluated by the appropriate XACML* instance. Only if the 
decision is permit is the policy file added or the policy 
datalD : policy ID is removed from the corresponding 
PDP. In case of policy addition, the new policy ID — the 
current index counter's value — is forwarded back to the data 
owner. We assume that policy is small, thus there is no need 
for the 2-step protocols as in adding and removing data. 

c) Querying policy. : Both data owner and the server 
keep track of the policy IDs associated with the dataset. One 
can query about the loaded policies for a dataset, using 

{{ (policylD, description) }, fail} 

<- queryPolicy (datalD, credentials} 

which returns a set of tuples (policylD, 
description) where description is the Description 
element of the corresponding policy. 

G. Data Request. 

A data user issues a request for data through the client 
interface. The request may involve accessing multiple datasets. 
The data user knows datalDs, but may not know of the 
detailed structure of the datasets. 

1) Querying meta data.: A data user can issue a query for 
the dataset's meta data prior to requesting the raw data. Typical 
meta data includes table names and schemas. Data owners can 
restrict access to such information through a set of policies. 
To query meta data, the data user invokes: 



{{tablelD}, fail} 

<- queryTables (datalD, credentials) 
{ (columnID, type)}, fail} 

<- queryColumns (datalD , tablelD, 
credentials ) 

The proxy translates the client request into a well-formed, 
standard XACML request in which the Action attribute 
is set to show_table or show_column respectively. 
If the PDP returns a permit decision, the PEP retrieves 
and returns the database's metadata accordingly. The result 
for queryTables (if permitted) is a set of tablelDs, 
which can later be used in requesting raw data. The result 
for queryDataScheme is a set of tuples (columnID, 
type) representing the column name and type. 

2) Querying data.: Clients can request data by invoking: 

{{data record}, {matching policies}, fail} 

<- queryData ( requested resources, 
joining condition) 

where 

requested resources 

= { <credent ials , datald, {columns}, 
{actions}, { constraints }> } 

represents the resources requested from different datasets. 
joining condition specifies how the results from those 
datasets are joined. These results are returned separately if 
joining condition is null, constraints contains 
conditions that are applied to the returned data. For example, 
columrii > 9 where columrii £ {columns} indicates that 
the request is only for data whose columrii values are greater 
than 9. The protocol proceeds as follows: 

1) For every requested resource, the proxy creates a well- 
formed XACML request using datald, columns as 
Resources and actions as Actions attributes. The 
request is then forwarded to the server specified by 
datald. 

2) The XACML* instance returns access control decision, 
the accompanied data (if decision permitted), and IDs 
of the matching policies. 

3) The proxy, on receipt of non-empty data, applies con- 
ditions specified in contraints. Depending on the 
value of joining column, it performs data joining 
(discussed next) before sending the final response to the 
client. 

H. Data Joining. 

The joining condition parameter used in 
queryData specifies how the results are joined before 
returning to the client. In particular: 

joining condition £ {null, {c\, c 2 , .., Cfc}} 

where k is the number of requested resources and Cj (1 < 
i < k) are the joining columns of the returned data. When 
joining column = null, the proxy forwards what it 
receives from the server directly back to the client. Otherwise, 
it waits until getting data from all requested servers, then 
constructs a client response by joining the results using normal 
database join operations. 
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/. Conflict Resolution. 

It is possible for clients to receive empty data for their 
requests, especially when the requests involve more than one 
datasets. This arises because different policies associated with 
different datasets are enforced. We refer to this as policy 
conflict, which happens in one of the two cases: 

1) There is at least one policy that denies the client's access. 

2) All policies permit access, but the joined data still results 
in an empty set. For example, one policy allows access to 
data where columnt > 9 whereas another policy allows 
access to data where columrii < 9. Another example is 
when two policies specify different sliding windows, as 
a consequence the joining columns do not have values 
in common. 

We provide a simple mechanism for dealing with policy 
conflict. Responses from queryData includes IDs of the 
matching policies. When conflict occurs, the client is aware of 
the cause and is able to contact the dataset owner to resolve the 
conflict. We assume that such resolution is done out-of-band 
and is not within the scope of the framework. 

J. Caching. 

The proxy maintains a cache of data received from the 
servers. Since operations in the cloud server are slow, espe- 
cially when involving database access, caching can improve 
the response time. It is also reasonable to expect a cache- 
friendly request pattern from clients, as popular data are 
frequently requested. 

We consider a simple design, in which data cache is the 
map <request> : <data> where request is the XACML 
request with the corresponding data. 

• Cache replacement: when full, an old entry is evicted in 
a random fashion. 

• Cache coherence: stale entries can lead to security vio- 
lation. For instance, a new policy update denies a client 
access to a dataset, but the cache contains data of previous 
access which will be served by the proxy at the client's 
next request. We address this problem by simply purging 
entire cache every time a policy is loaded or removed. 

V. Prototype and Evaluation 

A. Prototype 

We have implemented a prototype of eXACML, which 
consists of over 3400 lines of Java code. Database accesses 
are provided by JDBC API, while communications between 
clients, proxy and servers are done through Socket inter- 
face. For XACML*, we extended Sun's XACML implemen- 
tation ESI — an open source, Java project that supports 
XACML 2.0 standard. We instrumented its PEP module to 
handle more obligations (Section 3). The prototype supports 
all the features discussed in the previous section: a client is 
able to load, remove, query data and policies. 

Our prototype provides an easy-to-use graphical interface 
for querying and managing data. A query form (Fig. [8J3) 
takes in user credentials and requests. A response from the 
server includes the data server information, matching policies 



ft n o 



eXACML Implementation 



File View Help 



Policy view Data view 



Query raw J f Query meta \ 



Upload 



From /155.69. 143.60:8000 
r.'atchiriri policies: policy- L 
Decision = Permit 



28.5 
28.7 
28.7 
28.5 
28.7 
28.7 
28.7 
28.7 
28.7 
28.7 
28.7 
28.7 
28.7 



5amplingTime 
201L-05-12 OS 
2011-05-12 03 
2011-05-12 03 
2011-05-12 03 
2011-05-12 03 
2011-05-12 03 
2011-05-12 03 
2011-05-12 04 
2011-05-12 04 
2011-05-12 04 
2011-05-12 04 
2011-05-12 04 
2011-05-12 04 



;:.oo.o 

53:00.0 
54:00.0 
55:00.0 
56:00.0 
57:00.0 
59:00.0 
00:00.0 
01:00.0 
02:00.0 
03:00.0 
04:00.0 
05:00.0 



(a) Data view 

™ " ry Querying raw data 



Loaded Resource 



e_data Darald 



I^Add resource 



joining column 



( Cancel '} ( OK ) 

(b) Query form 

Fig. 8: User interface for querying data 



H r> n 



eXACML Implementation 



File View Help 



Policy view Data view 1 



(_ Upload ) ^ Query j Remove ) 



From /155.69.L43.60:SO00 

Status - 5UCCE55 

Pol icy file index was added 



(a) Policy view 
ft O O Upload new policy 



transport^ data 
..ypolicyFileLxm| 



Datald 

( Policy file ) 



( Cancel ) OK j 

/a 

(b) Policy upload 

Fig. 9: User interface for managing access control policies. 



12 



and the data (if applicable), which are displayed in the data 
view window (Fig. [8^). Policies are updated and queried using 
similar GUI, as shown in Fig. [9] 

B. Evaluation 

We evaluated our prototype's performance, and its ability 
to support dynamic, fine-grained access control policies. The 
system performance is measured by the time taken to ful- 
fill user requests. We compare our prototype's performance 
against that of a system that executes the requests directly, 
i.e. without the access control layer. We refer to the later as 
direct-query system. 

1) Methodologies. : 

a) Setup.: We emulate a cloud-like environment running 
our prototype, as shown in Fig. [6] More specifically, we make 
use of four machines, two running servers, on running the 
proxy and the other represents a client. The machines belong 
to the PDCC cluste^] each has one Xeon processor 3.0Ghz, 
running OCS5.1 (2.6.18-53E15smp) operating system with 
4GB of RAM. The machines are connected via InfiniBand 
20Gbps. 

The servers maintain two databases: a weather database 
and a traffic database. The former contains four tables with 
real data taken from four different weather stations collected 
in a 5-day duration and with one-minute sampling interval. 
We synthesize the traffic database with two tables containing 
records of traffic volume and vehicle speed that match with 
the weather datasets. 

b) Workloads.: We generate synthetic workloads that 
include large numbers of policies and requests. Since our 
prototype is compared against a direct-query system, the work- 
loads also contain a large number of direct database queries, 
each corresponds to a request in our prototype. A direct query 
is forwarded to the server, which executes and returns the same 
data as when executing the corresponding request in our sys- 
tem. The parameters used in generating workloads are shown 



in Table. VI The workloads and source code for generating 
them can be found at http://sands.sce.ntu.edu.sg/trac/exacml/ 
First, we use nDirectQueries and directQueryDist to 
create a set DQuery of direct queries of five different 
types: selection, approximation, aggregation, sliding window 
and data joining. The first three types are ordinary database 
SELECT query, which is forwarded by the server directly to 
the database engine. Sliding window queries are first converted 
into multiple SELECT queries, one for every window, which 
are then sent to the database engine. Data joining queries 
contain two sub-queries (of the other four types) chosen at 
random and for different data servers. Each data server pro- 
cesses and returns the result independently. Next, nPolicies 
unique XACML policies are generated, each with different 
exacml . subject : role-id. Every policy corresponds to 
a direct query whose type is either selection, approximation, 
aggregation or sliding window. Therefore, the set of policy 
obligations and DQuery represent the same set of SELECT 
queries to be executed by the database engines. 



http://pdcc.ntu.edu.sg/content/128-cores-linux-cluster-pdccsce 



Next, we generate a set of requests. For every policy, we 
construct one matching and one non-matching request. The 
matching request contains credentials, resources and actions 
as specified in the policy. For the non-matching request, we 
use a different 

exacml : rdbms-database-id from the weather and traf- 
fic database names. For each data joining direct query, we 
create corresponding (matching and non-matching) requests 
made up of two sub-requests. Each sub-requests from the 
matching request corresponds to a sub-query in the data 
joining direct query. In summary, a matching request executed 
in our prototype returns the same data as the corresponding 
query evaluated in the direct-query system. 

Finally, we create a workload of nRequests requests fol- 
lowing Zipf distribution with skew parameter a. This workload 
models a realistic use of the prototype, in which a small 
number of popular data are requested frequently. Such request 
pattern is found in many other systems, such as P2P file- 
sharing and web caching Q, lfT6l . We select max Rank 
unique queries from DQueries at random, then assign them 
with random ranks. A sequence of queries is generated from 
the selected set with Zipf distribution, using a = 0.223 (as in 
1 16]). For every direct query, this workload also contains the 
corresponding policy, matching and non-matching request. 

2) Metrics.: In the following experiments, we investigate 
our prototype's effectiveness in granting data access to au- 
thorized requests and denying unauthorized ones. We also 
measure its performance in terms of the time taken to fulfill 
authorized data requests. This is compared against the direct- 
query system, i.e. one without eXACML. We also provide 
quantitative analysis of the proxy, especially its caching and 
data joining features. 

3) Experiments and Results.: We first load nPolicies 
unique policies onto the data server. The measured time is 
reasonably small, with mean of 0.034s and standard deviation 
of 0.016 per loading operation. 

We then run two sets of experiments: 

1) The workload consisting of nDirectQueries unique 
queries and the corresponding unique requests. We en- 
able the data joining option at the proxy in the first run, 
and disable it in the second. To disable cache, we simply 
change the proxy configuration file. To run without the 
joining option, we re-generate the workload without data 
joining queries and requests. We measure the time taken 
to fulfill direct queries and data requests. 

2) The workload contains nRequests queries and the cor- 
responding requests, which follow the Zipf distribution. 

In both experiments, non-matching requests are denied 
access. Fig. [10] and Fig. [TT] compare the performance of our 
prototype against direct-query system, using measurements 
of matching requests. In both figures, there is a number of 
requests taking over 5s to finish. They are sliding window 
requests, which translates into a large number of SELECT 
queries to be executed by the database engines. That the server 
needs to wait and aggregate the results into a single client 
message, and that JDBC implementation incurs non-significant 
overhead for executing a SELECT query both contribute to the 
noticeable delay. 
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Variable 




Description 


nDivcctQucvi&s 


1000 


number of direct queries 


directQueryDist 


248:248:248:156:100 


distribution of direct queries 
(selection:approximation: aggregation: sliding 
window rjoining request) 


nPolicies 


900 


number of unique policies 


nRequests 


1500 


number of matching requests 


a 


0.223 


skew parameter for Zipf distribution 


max Rank 


300 


maximum rank of unique requests from 
which Zipf distribution is generated 



TABLE VI: Summary of parameters used in setting up experiments 
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Fig. 12: Benefit of caching on performance. Queries follow 
Zipf distribution 
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Fig. 11: Overall performance with exacmlXACML when the 
joining and caching options are disabled 



Fig. [10] illustrates eXACML's overhead when both caching 
and data joining options at the proxy are enabled. For unique 
queries and requests, there is no overhead from the 99 th 
percentile. 80% of the requests incurs less than 10% overhead. 
The largest overhead is less than 0.4s and is observed from 
between 87% to 90% percentile. An interesting pattern in 
which eXACML outperforms the direct-query system can be 
seen at lower percentiles. Besides network and computational 
variations, this can be attributed to the data joining feature at 
the proxy (discussed later). For requests and queries following 



Zipf distribution, eXACML performs better most of the time 
(up until the 89 th percentile). This is thanks to the caching 
mechanism at the proxy, whose benefit will be analyzed in 
more detail later. 



Fig. 11 shows how the overhead changes when the proxy 



performs neither caching nor data joining. The overhead is 
more discernible: for unique requests, the overhead starts from 



10 



20 th percentile, as compared to 45 th percentile in Fig 
Similarly, for queries following Zipf distribution, the overhead 
is seen from 10*' 1 percentile, as compared to 89 t/l percentile in 



Fig. 10 This implies that caching and data joining at the proxy 
are most effective when the query distribution is heavy-tailed. 

We proceed to analyze benefits of caching at the proxy. 
Request times for Zipf-distribution queries with and without 
cache are extracted from the experiments and plotted in 



Fig. 12 We show the results with and without data joining 
queries. In both cases, caching results in better performance. 
By itself, i.e. without the joining data feature, caching leads 
to 50% improvement for more than 80% of the requests. For 
the workload including data joining queries, a similar pattern 
can be seen, although the improvement is not as noticeable. 

Finally, we analyze the benefit of the data joining feature at 
the proxy. We run the same experiments as before, but with 
workloads consisting of only data joining queries and requests. 



The results shown in Fig. 13 are for both unique and Zipf- 
distribution requests. It can be seen that eXACML outperforms 
the direct-query system up until 65*^ percentile for unique 
queries and 70 th percentile for Zipf-distribution queries. This 
is because for most requests, eXACML helps reducing the 
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Fig. 13: Benefit of proxy performing data joins. All queries 
require data joining 

data size substantially (by joining the results from two servers) 
before transferring it back to the client. In contrast, without 
eXACML, the client has to wait for all data to come back 
individually before performing joining by itself. Notice that 
some requests in eXACML still experience longer delay (after 
70 th percentile), because extra communication between client 
and proxy (as opposed to the direct communication between 
client and server) and computation overhead at the proxy are 
not fully discounted. 

VI. Related work 

There exists cloud-based systems that enable data sharing 
from multiple sources. SenseWeb IT261 . SensorBase [ 1 1 are 
examples of cloud services that let users upload and share 
their sensor data. They support coarse-grained access control 
model in which an user either makes its dataset public, shares 
it with a list of collaborators or keeps it private. Similarly, 
Google's Fusion Table 1 14J allows user to upload generic data 
and to perform simple analysis such as data visualization on 
the cloud. Recently, companies such as Okta ll22l have started 
implementing cloud-brokerage models that provide centralized 
service for management of enterprises' resources, including 
access control. However, these access control model is also 
coarse-grained, which means it cannot deal with the access 
scenarios we consider in this paper. In addition, data owners 
in these systems upload their datasets onto a centralized 
cloud, whereas our work does not make such assumption 
(we consider multi-cloud environment in which different data 
owner uses its own cloud provider). 

There are also numerous works focusing on access control 
and data privacy on the cloud. Airavat [27], for example, 
assumes the cloud is trusted in enforcing access control. It 
uses a simple mandatory access control system available in 
SELinux [1|, and provides a trusted environment for exe- 
cuting MapReduce [ 1 1 1 jobs while guaranteeing differential 
privacy fl2l . Our work makes the same assumption about 
clouds' trustworthiness, but aims at improving the access con- 
trol aspect of the system, which is complementary to Airavat. 
Other works |29l , 03], ll23l assume the cloud is untrusted 



and employ cryptographic approach for access control. In |29|, 
data is encrypted with attribute-based encryption lfT3ll . (7| by 
a proxy using a proxy re-encryption technique. Embedded 
in the ciphertext are conditions that must be met when 
decrypting. Plustus and CloudProof fl31 . Il23l use broadcast 
encryption [ 19] to protect the data, while key management lfT31l 
is done using key rolling and lazy revocation techniques. 
These cryptographic approaches provide strong guarantees 
for data security, but they cannot express fine-grained access 
control policies as described in our work. Thus the focus in 
these works is also complimentary to ours. In addition, key 
management and revocation protocols are complex and incur 
much overhead in such an untrusted environment. 

Multiple policies matching in XACML is usually resolved 
by the top-level policy combining algorithms. XACML sup- 
ports only a limited number of combining algorithms. Ninghui 
et al. 1 20 1 and Rao et al. |25 1 propose a formal language for ex- 
pressing more fine-grained policy composition. The language 
can deal with evaluation errors and combining of obligations. 
Mazolleni et al. ifTTll propose a method for combining policies 
based on their similarity and users' preferences. 

Time-series data — similar to those considered in our paper 
— could arrive at the system in continuous streams, for which 
relational databases such as MySQL and Postgresql are not 
ideal. Aurora is a popular data stream management system 
that addresses limitations of relational databases when it comes 
to stream data. Carminati et al. (9), J8J are among the first 
to propose a model and implementation of access control for 
data streams based on Aurora. The model supports four access 
scenarios: column-based, value-based, general window and 
sliding window. Our framework supports all of these scenarios 
for on-demand queries over archival databases. The extension 
to eXACML that deals with continuous queries over stream 
databases is left for future work. 

VII. Future work 

We have implemented a simple prototype and carried out 
preliminary evaluation of our framework. The next step would 
be to improve the prototype and perform more comprehensive 
evaluations. More specifically, the cloud-like environment set 
up in the experiment contains only two data servers. In 
addition, only one dataset comes from real monitoring stations, 
and the workloads are synthetic. Therefore, we plan to acquire 
more realistic datasets and workloads, and to evaluate the 
prototype with larger numbers of data servers. We also plan 
to export our prototype into real cloud environments such as 
Amazon EC2 and Microsoft's Azure |@), lfl8l . and benchmark 
it with real data mining applications accessing real datasets. 

We assumed that each dataset is guarded by an independent 
XACML* instance. We have acknowledged the trade-offs 
in having multiple datasets sharing one XACML* instance, 
especially when datasets reside in the same physical machine. 
Another trade-off is the number of proxy servers. It would be 
interesting to investigate these trade-offs further by extending 
the framework with XACML* sharing and distributed proxies. 

As shown in Table [I] eXACML only deals with archival 
databases and queries. The immediate extension will be to 
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support stream databases and continuous queries. Relational 
databases are not the best tool for handling stream data, 
for which other models have been proposed [|2||. We will 
examine the design and compare performance of the extended 
eXACML to that of the existing works on access control for 
stream data 0, 0. 

Regarding data sharing, access control only addresses the 
problem of authorization. We have so far made an assumption 
that authentication is implicit, that is, clients are given static 
credentials and the servers always accept the given credentials. 
We plan to incorporate an authentication model into our 
framework. It is an interesting challenge in decentralized 
settings, of which our multi-cloud scenario is an example, 
since authentication may depend not only on static credentials 
but also on previous interactions between parties and the states 
of the entire system. Authentication is also an important when 
the cloud provider has to log and notify data owners of access 
to their data (for billing purposes, for example). We plan 
to use other access control languages such as DynPal 
or SecPal 0, because they are more suitable for handling 
dynamic authentication than XACML. 

Finally, we have always assumed the cloud is trusted in 
enforcing access control policies and not to violate user's data 
security and privacy. However, users with sensitive data or data 
that have been expensive to collect will demand highest level 
of security. As a consequence, they cannot assume the cloud 
is trusted in handling their data. Existing works have taken 
the cryptographic approach that encrypt data and attempts 
to outsource the key management to the cloud. Nevertheless, 
the range of access control policies supported by the existing 
systems has been limited. For future work, we aim to find 
practical cryptographic protocols that can handle more fine- 
grained access control scenarios. Since eXACML contains two 
components belonging to third parties: the proxy server and the 
cloud servers, we will investigate relaxing the trust assumption 
for these components one by one. 

VIII. Conclusion 

In this paper, we have proposed a framework (eXACML) 
that allows users to share their data on the cloud in a secure, 
flexible, easy-to-use and scalable manner. We considered a 
trusted cloud environment, in which data are maintained in 
relational databases. The cloud environment makes it easy for 
data owners to share and benefit from mining the aggregated 
data. The main challenge is how to let users control access 
to their data in most flexible ways. We achieved security 
and flexibility by extending the XACML framework, allowing 
users to specify fine-grained access control policies. Our 
framework contains a proxy server residing in between clients 
and the cloud servers. It processes requests from the clients, 
joins and caches responses from the servers before sending 
back to the client. We have implemented a prototype and 
carried out preliminary experiments to evaluate its perfor- 
mance. The results suggested that the framework is scalable, 
as the overhead incurred is small, thanks to the caching and 
data joining features at the proxy. In addition, the prototype 
provides a graphical user interface that lets users share and 
manage their data in an easy-to-use manner. 



We believe that in order to take full advantage of cloud 
computing, having a framework such as ours is very important. 
Our paper has taken the first steps towards realizing a practical 
and usable sharing-friendly cloud environment. We have also 
identified many avenues for future work, such as improving 
scalability with more proxies, adding support for stream data 
and other policy languages, and relaxing assumptions on the 
trustworthiness of the cloud. 
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